ANGSD Project: Mohith Arikatla (moa4020)

Index

  1. Introduction
  2. Results
  3. Methods
  4. Discussion
  5. Key datasets used in this study

Introduction

A brief paragraph summarizing the scientific background and/or why that particular question is interesting (you should cite at least 3-5 papers). Of course, you should clearly state the question you're going to investigate as well as the specific hypothesis you set out to test.

RNA sequencing (RNA-seq) is widely used to profile transcriptomes and identify differentially expressed genes (DEGs) between conditions. The alignment algorithm used in RNA-seq data analysis can affect the results of downstream analysis, including pathway enrichment analysis.

Salmon and STAR are two widely used alignment algorithms for RNA-seq data. The specific question to investigate is whether the use of Salmon and STAR alignment algorithms results in different pathways being enriched. This question is interesting because pathway enrichment analysis is often used to identify biological processes involved in specific conditions, and differences in pathway analysis based on the alignment algorithm used could impact the interpretation of results.

Several studies have compared the performance of different RNA-seq alignment algorithms, including Salmon and STAR, and have shown that the choice of alignment algorithm can affect the results of downstream analysis. For example, a study by Hu et al. (2019) compared the performance of different RNA-seq alignment algorithms and showed that the choice of alignment algorithm can significantly affect the identification of DEGs. Another study by Xu et al. (2020) evaluated the performance of different RNA-seq alignment algorithms for quantifying gene expression and showed that Salmon performed better than STAR for some metrics. A third study by Tarazona et al. (2015) compared different RNA-seq analysis pipelines and showed that different pipelines can lead to different results in terms of DEG identification and pathway enrichment analysis.

Overall, the hypothesis is that applying Salmon and STAR alignment algorithms will result in different pathways being enriched.

Results

Another brief paragraph summarizing your key insights and possible future experiments/analyses that might enhance your own analysis. Make sure to include a discussion of the limits that your data set has!

  1. There appears to be no significant difference between the end results of both the algorithms when considering this particular functionality (Differential Expression Analysis).
  2. This might be limited to well studied genomes. More research needs to be done on organisms that have not been indexed accurately.
  3. Conducting Gene Ontology assessment requires prior knowledge about the over-represented pathways that pop-up in certain genomes often, making a note of these GO terms while interpreting the results of GO enrichment will keep one from being misled due to these biases.
  4. Further well thought out GO analysis should be conducted on this dataset.

Limitations:

  1. The authors did not disclose what kind of pathway analysis they’ve performed in the paper. This lead to a lot of confusion due to a mismatch between my results and the paper’s, post-EnrichKEGG.
  2. The authors also did not mention dealing with over-represented pathways based on the gene-ratio bias.
  3. The RNA-Seq analysis for this paper was delegated to the bioinformatics-core in Harvard explaining why the finer details for this step were not provided.

Methods

A detailed verbose description of all the steps you took to arrive at the conclusion including how and where the data was downloaded, pre-processed and analyzed. This should also include some brief reasoning of why you chose certain tools/solutions and what the results of the QC tell you about the data at hand.

  1. Downloading Data & Basic QC
  2. [Downloading the .fastq files]
  3. Downloading the .fasta file for mm10 genome
  4. Pre-alignment Basic QC
  5. STAR Indexing & Alignment
  6. STAR Post-alignment MultiQC
  7. Downloading and Installing Salmon
  8. Salmon Post-Alignment Multiqc
  9. DESeq
  10. KEGG Pathway Enrichment
  11. [List of genes mentioned to be differentially expressed]

Details about the project files

  • What publication is it linked to?

    Ans: The cited article is the source for this dataset:

    Li, S., & Jakobs, T. C. (2022). Secreted phosphoprotein 1 slows neurodegeneration and rescues visual function in mouse models of aging and glaucoma. Cell reports, 41(13), 111880. https://doi.org/10.1016/j.celrep.2022.111880

  • Who generated the data?

    Ans: The authors Song Li and Tatjana C Jakobs produced the data. They are affiliated to the Department of Ophthalmology, Harvard Medical School, Boston.

  • Where can the dataset be found?

    Ans: On the Gene Expression Omnibus server: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE174522

  • How was the RNA extracted?

    Ans: RNA was extracted using the RNeasy Plus Micro Kit (Qiagen; 74034)

  • What library prep was used?

    Ans: mRNA profiles were generated by deep sequencing using Illumina NovaSeq 6000

  • What cell type was used?

    Ans: Astrocytes isolated from either C57BL/6 wild-type or Spp1 KO neonatal mice were used

  • What was the treatment/experimental condition?

    Ans: The experimental condition was the knockout of Spp1

  • What sequencing platform was used?

    Ans: Illumina NovaSeq 6000 was used for sequencing

Downloading Data & Basic QC

Downloading the .fastq files

I have used the following script to download the fastq files from the .tsv file I’ve obtained from ENA. It was parsed to obtain the .fastq ftp links along-side the run accession number. The file contains the following links that have been parsed from the .tsv file from this link:

$ wget "ftp://https://www.ebi.ac.uk/ena/portal/api/filereport?accession=PRJNA730426&result=read_run&fields=run_accession,fastq_ftp,sample_title&format=tsv&download=true&limit=0" >> PRJNA730426.tsv

$ cat PRJNA730426.tsv
sample_accession        run_accession   fastq_ftp       sample_title
SAMN19231497    SRR14563232     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/032/SRR14563232/SRR14563232_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/032/SRR14563232/SRR14563232_2.fastq.gz    wt_1
SAMN19231496    SRR14563233     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/033/SRR14563233/SRR14563233_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/033/SRR14563233/SRR14563233_2.fastq.gz    wt_2
SAMN19231495    SRR14563234     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/034/SRR14563234/SRR14563234_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/034/SRR14563234/SRR14563234_2.fastq.gz    wt_3
SAMN19231494    SRR14563235     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/035/SRR14563235/SRR14563235_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/035/SRR14563235/SRR14563235_2.fastq.gz    wt_4
SAMN19231493    SRR14563236     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/036/SRR14563236/SRR14563236_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/036/SRR14563236/SRR14563236_2.fastq.gz    SPP1 KO_1
SAMN19231492    SRR14563237     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/037/SRR14563237/SRR14563237_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/037/SRR14563237/SRR14563237_2.fastq.gz    SPP1 KO_2
SAMN19231491    SRR14563238     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/038/SRR14563238/SRR14563238_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/038/SRR14563238/SRR14563238_2.fastq.gz    SPP1 KO_3
SAMN19231490    SRR14563239     ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/039/SRR14563239/SRR14563239_1.fastq.gz;ftp.sra.ebi.ac.uk/vol1/fastq/SRR145/039/SRR14563239/SRR14563239_2.fastq.gz    SPP1 KO_4

cut -f 2,3,4 PRJNA730426.tsv > fastq_links.txt

Adding ftp:// in front of each instance of ftp and removing the space after SPP1 in the sample_title column.

$ sed 's|ftp|ftp://ftp|g' fastq_links.txt >> ftp_links.txt

$ sed 's|SPP1 |SPP1_|g' ftp_links.txt >>  ftp_links1.txt

$ mv ftp_links1.txt ftp_links.txt

Downloading the .fasta file for mm10 genome

$ wget "ftp://ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.fa.gz"

Copying the annotation .gtf file from my desktop obtained from the UCSC genome table browser with the following options:

% scp /Users/mar/Downloads/mm10.ncbiRefSeq.gtf moa4020@aphrodite.med.cornell.edu:/athena/angsd/scratch/moa4020/project/referenceGenome/mm10

I have used the following script to download and rename the .fastq files with the sample name.

$ sbatch $MyScripts/downloading_fastq.sh

#!/bin/bash
#SBATCH --job-name=downloading_fastq
#SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/downloading_fastq_%j.out
#SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/downloading_fastq_%j.err
#SBATCH --mail-user=moa4020@med.cornell.edu
#SBATCH --mail-type=ALL
#SBATCH --mem=32G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=5

sed '1d' ftp_links.txt > links.txt

while read line; do
  links=$(echo "$line" | cut -f 2)
  url=$(echo "$links" | cut -d';' -f1)
  sample_name=$(echo "$line" | cut -f 3)
  echo "Downloading $url"
  wget "$url" -O "${sample_name}_1.fastq.gz"
done < links.txt

while read line; do
  links=$(echo "$line" | cut -f 2)
  url=$(echo "$links" | cut -d';' -f2)
  sample_name=$(echo "$line" | cut -f 3)
  echo "Downloading $url"
  wget "$url" -O "${sample_name}_2.fastq.gz"
done < links.txt

rm links.txt

Pre-alignment Basic QC

includeHTML("QC/pre_alignment_multiqc_report.html")
MultiQC Report

Highlight Samples

Regex mode off

    Rename Samples

    Click here for bulk input.

    Paste two columns of a tab-delimited table here (eg. from Excel).

    First column should be the old name, second column the new name.

    Regex mode off

      Show / Hide Samples

      Regex mode off

        Export Plots

        px
        px
        X

        Download the raw data used to create the plots in this report below:

        Note that additional data was saved in multiqc_data when this report was generated.


        Choose Plots

        If you use plots from MultiQC in a publication or presentation, please cite:

        MultiQC: Summarize analysis results for multiple tools and samples in a single report
        Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
        Bioinformatics (2016)
        doi: 10.1093/bioinformatics/btw354
        PMID: 27312411

        Save Settings

        You can save the toolbox settings for this report to the browser.


        Load Settings

        Choose a saved report profile from the dropdown box below:

        Tool Citations

        Please remember to cite the tools that you use in your analysis.

        To help with this, you can download publication details of the tools mentioned in this report:

        About MultiQC

        This report was generated using MultiQC, version 1.14

        You can see a YouTube video describing how to use MultiQC reports here: https://youtu.be/qPbIlO_KWN0

        For more information about MultiQC, including other videos and extensive documentation, please visit http://multiqc.info

        You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: https://github.com/ewels/MultiQC

        MultiQC is published in Bioinformatics:

        MultiQC: Summarize analysis results for multiple tools and samples in a single report
        Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
        Bioinformatics (2016)
        doi: 10.1093/bioinformatics/btw354
        PMID: 27312411

        A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.

        Report generated on 2023-03-24, 11:04 EDT based on data in: /athena/angsd/scratch/moa4020/project/GEO_Dataset/fastqc


        General Statistics

        Showing 16/16 rows and 3/6 columns.
        Sample Name% Dups% GCM Seqs
        SPP1_KO_1_1
        82.2%
        48%
        18.1
        SPP1_KO_1_2
        80.4%
        49%
        18.1
        SPP1_KO_2_1
        57.4%
        50%
        23.3
        SPP1_KO_2_2
        56.5%
        50%
        23.3
        SPP1_KO_3_1
        63.6%
        51%
        26.5
        SPP1_KO_3_2
        62.5%
        51%
        26.5
        SPP1_KO_4_1
        72.4%
        50%
        22.2
        SPP1_KO_4_2
        71.4%
        50%
        22.2
        wt_1_1
        56.2%
        50%
        26.6
        wt_1_2
        54.9%
        50%
        26.6
        wt_2_1
        59.7%
        50%
        30.3
        wt_2_2
        58.0%
        50%
        30.3
        wt_3_1
        54.5%
        50%
        23.4
        wt_3_2
        53.6%
        50%
        23.4
        wt_4_1
        58.3%
        50%
        28.1
        wt_4_2
        57.2%
        50%
        28.1

        FastQC

        FastQC is a quality control tool for high throughput sequence data, written by Simon Andrews at the Babraham Institute in Cambridge.

        Sequence Counts

        Sequence counts for each sample. Duplicate read counts are an estimate only.

        This plot show the total number of reads, broken down into unique and duplicate if possible (only more recent versions of FastQC give duplicate info).

        You can read more about duplicate calculation in the FastQC documentation. A small part has been copied here for convenience:

        Only sequences which first appear in the first 100,000 sequences in each file are analysed. This should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level.

        The duplication detection requires an exact sequence match over the whole length of the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.

        loading..

        Sequence Quality Histograms

        The mean quality value across each base position in the read.

        To enable multiple samples to be plotted on the same graph, only the mean quality scores are plotted (unlike the box plots seen in FastQC reports).

        Taken from the FastQC help:

        The y-axis on the graph shows the quality scores. The higher the score, the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.

        loading..

        Per Sequence Quality Scores

        The number of reads with average quality scores. Shows if a subset of reads has poor quality.

        From the FastQC help:

        The per sequence quality score report allows you to see if a subset of your sequences have universally low quality values. It is often the case that a subset of sequences will have universally poor quality, however these should represent only a small percentage of the total sequences.

        loading..

        Per Base Sequence Content

        The proportion of each base position for which each of the four normal DNA bases has been called.

        To enable multiple samples to be shown in a single plot, the base composition data is shown as a heatmap. The colours represent the balance between the four bases: an even distribution should give an even muddy brown colour. Hover over the plot to see the percentage of the four bases under the cursor.

        To see the data as a line plot, as in the original FastQC graph, click on a sample track.

        From the FastQC help:

        Per Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called.

        In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other.

        It's worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis.

        Click a sample row to see a line plot for that dataset.
        Rollover for sample name
        Position: -
        %T: -
        %C: -
        %A: -
        %G: -

        Per Sequence GC Content

        The average GC content of reads. Normal random library typically have a roughly normal distribution of GC content.

        From the FastQC help:

        This module measures the GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content.

        In a normal random library you would expect to see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome. Since we don't know the the GC content of the genome the modal GC content is calculated from the observed data and used to build a reference distribution.

        An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset. A normal distribution which is shifted indicates some systematic bias which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won't be flagged as an error by the module since it doesn't know what your genome's GC content should be.

        loading..

        Per Base N Content

        The percentage of base calls at each position for which an N was called.

        From the FastQC help:

        If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base call. This graph shows the percentage of base calls at each position for which an N was called.

        It's not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence. However, if this proportion rises above a few percent it suggests that the analysis pipeline was unable to interpret the data well enough to make valid base calls.

        loading..

        Sequence Length Distribution

        All samples have sequences of a single length (150bp).

        Sequence Duplication Levels

        The relative level of duplication found for every sequence.

        From the FastQC Help:

        In a diverse library most sequences will occur only once in the final set. A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification). This graph shows the degree of duplication for every sequence in a library: the relative number of sequences with different degrees of duplication.

        Only sequences which first appear in the first 100,000 sequences in each file are analysed. This should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level.

        The duplication detection requires an exact sequence match over the whole length of the sequence. Any reads over 75bp in length are truncated to 50bp for this analysis.

        In a properly diverse library most sequences should fall into the far left of the plot in both the red and blue lines. A general level of enrichment, indicating broad oversequencing in the library will tend to flatten the lines, lowering the low end and generally raising other categories. More specific enrichments of subsets, or the presence of low complexity contaminants will tend to produce spikes towards the right of the plot.

        loading..

        Overrepresented sequences

        The total amount of overrepresented sequences found in each library.

        FastQC calculates and lists overrepresented sequences in FastQ files. It would not be possible to show this for all samples in a MultiQC report, so instead this plot shows the number of sequences categorized as over represented.

        Sometimes, a single sequence may account for a large number of reads in a dataset. To show this, the bars are split into two: the first shows the overrepresented reads that come from the single most common sequence. The second shows the total count from all remaining overrepresented sequences.

        From the FastQC Help:

        A normal high-throughput library will contain a diverse set of sequences, with no individual sequence making up a tiny fraction of the whole. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected.

        FastQC lists all of the sequences which make up more than 0.1% of the total. To conserve memory only sequences which appear in the first 100,000 sequences are tracked to the end of the file. It is therefore possible that a sequence which is overrepresented but doesn't appear at the start of the file for some reason could be missed by this module.

        16 samples had less than 1% of reads made up of overrepresented sequences

        Adapter Content

        The cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position.

        Note that only samples with ≥ 0.1% adapter contamination are shown.

        There may be several lines per sample, as one is shown for each adapter detected in the file.

        From the FastQC Help:

        The plot shows a cumulative percentage count of the proportion of your library which has seen each of the adapter sequences at each position. Once a sequence has been seen in a read it is counted as being present right through to the end of the read so the percentages you see will only increase as the read length goes on.

        loading..

        Status Checks

        Status for each FastQC section showing whether results seem entirely normal (green), slightly abnormal (orange) or very unusual (red).

        FastQC assigns a status for each section of the report. These give a quick evaluation of whether the results of the analysis seem entirely normal (green), slightly abnormal (orange) or very unusual (red).

        It is important to stress that although the analysis results appear to give a pass/fail result, these evaluations must be taken in the context of what you expect from your library. A 'normal' sample as far as FastQC is concerned is random and diverse. Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look random and diverse.

        Specific guidance on how to interpret the output of each module can be found in the relevant report section, or in the FastQC help.

        In this heatmap, we summarise all of these into a single heatmap for a quick overview. Note that not all FastQC sections have plots in MultiQC reports, but all status checks are shown in this heatmap.

        loading..

        Inference: it is safe to say that there are no confounding effects that can be attributed to one of the conditions from these results. Although the least amount and the highest amount of reads seem to come from one of the samples in SPP1_KO and wt respectively the sample size is low and the difference is not uniform across both the conditions to assume that this might cause a bias.

        STAR Indexing & Alignment

        I have used the following scripts to perform indexing and alignment:

        $ sbatch $MyScripts/indexing.sh

        #!/bin/bash -l
        #SBATCH --job-name=STAR_genomeGenerate
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/STAR_genomeGenerate_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/STAR_genomeGenerate_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        #SBATCH --time=10:00:00
        #SBATCH --mem=40G
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1
        #SBATCH --cpus-per-task=10
        
        # activate your environment
        mamba activate angsd
        
        STAR --runMode genomeGenerate \
             --runThreadN 10 \
             --genomeDir /athena/angsd/scratch/moa4020/project/referenceGenome/GRCm38_STARindex \
             --genomeFastaFiles /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.fa \
             --sjdbGTFfile /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.ncbiRefSeq.gtf \
             --sjdbOverhang 149
             

        The parameters used in the alignment command are explained below:

        • --runMode genomeGenerate: This specifies the mode of the STAR program, which is genome generation in this case.

        • --runThreads: This sets the number of threads used for the genome generation process.

        • --genomeDir /athena/angsd/scratch/moa4020/project/referenceGenome/GRCm38_STARindex: This specifies the directory where the genome index will be saved.

        • --genomeFastaFiles /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.fa: This specifies the path to the FASTA file of the genome reference sequence that will be used to generate the genome index.

        • --sjdbGTFfile /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.ncbiRefSeq.gtf: This specifies the path to the GTF annotation file that will be used to guide the genome generation process.

        • --sjdbOverhang 149: This specifies the length of the genomic sequence flanking the splice junctions that will be used to generate the genome index.

        $ sbatch $MyScripts/STAR_alignment.sh

        #!/bin/bash -l
        #SBATCH --job-name=STAR_alignReads
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/STAR_alignReads_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/STAR_alignReads_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        #SBATCH --time=10:00:00
        #SBATCH --mem=40G
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1
        #SBATCH --cpus-per-task=15
        
        # Activate your environment
        mamba activate angsd
        
        IndexDir=/athena/angsd/scratch/moa4020/project/referenceGenome/GRCm38_STARindex
        
        fastaDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/fastq
        
        outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments
        
        # Make a table with the sample name and paths to the 1st and 2nd mate
        
        # Set the name of the output file
        sample_table=sample_table.txt
        
        > $sample_table
        
        # Iterate over the fastq files in the directory
        for file in ${fastaDir}/*_1.fastq.gz; do
          file1=$(basename ${file})
          sample=$(basename ${file%_1.fastq.gz})
          file2=${sample}_2.fastq.gz
          echo -e "${sample}\t${file1}\t${file2}" >> $sample_table
        done
        
        while read -r sample mate1 mate2
        do
                echo "Processing sample: $sample"
                STAR --runMode alignReads \
                     --runThreadN 15 \
                     --genomeDir $IndexDir \
                     --readFilesIn $fastaDir/$mate1 $fastaDir/$mate2 \
                     --readFilesCommand zcat \
                     --outFileNamePrefix $outDir/"$sample." \
                     --outSAMtype BAM SortedByCoordinate \
                     --outSAMunmapped Within \
                     --outSAMattributes All
        
        done < $sample_table
        
        samtools index -M $outDir/*.out.bam

        The parameters used in the alignment command are explained below:

        • --runMode alignReads: This parameter specifies the mode in which STAR should run, which in this case is to align reads.

        • --runThreadN: This parameter specifies the number of threads that should be used for the alignment, which in this case is 10. Using multiple threads can speed up the alignment process.

        • --genomeDir: This parameter specifies the directory where the genome index files are located.

        • --readFilesIn: This parameter specifies the input FASTQ file that contains the reads to be aligned.

        • --readFilesCommand zcat: This parameter specifies the command that should be used to decompress the input FASTQ file. In this case, zcat is used to decompress a gzipped file.

        • --outFileNamePrefix: This parameter specifies the prefix to be used for the output files.

        • --outSAMtype BAM SortedByCoordinate: This parameter specifies the format of the output SAM/BAM file, which in this case is BAM, sorted by coordinate.

        • --outSAMunmapped Within: This parameter specifies that unmapped reads should be included in the output file, but marked as “within” the aligned reads.

        • --outSAMattributes All: This parameter specifies which attributes should be included in the output SAM/BAM file.

        Performing alignment once again using the novel splice junction data obtained from the first iteration of alignment to improve the accuracy of gene expression data.

        $ sbatch $MyScripts/STAR_2alignment.sh

        #!/bin/bash -l
        #SBATCH --job-name=STAR_alignReads_it2
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/STAR_alignReads_it2_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/STAR_alignReads_it2_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        #SBATCH --time=10:00:00
        #SBATCH --mem=40G
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1
        #SBATCH --cpus-per-task=15
        
        # Activate your environment
        mamba activate angsd
        
        IndexDir=/athena/angsd/scratch/moa4020/project/referenceGenome/GRCm38_STARindex
        
        fastaDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/fastq
        
        it1Dir=athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments
        
        outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2
        
        # Make a table with the sample name and paths to the 1st and 2nd mate
        
        # Set the name of the output file
        sample_table=sample_table.txt
        
        > $sample_table
        
        # Iterate over the fastq files in the directory
        for file in ${fastaDir}/*_1.fastq.gz; do
          file1=$(basename ${file})
          sample=$(basename ${file%_1.fastq.gz})
          file2=${sample}_2.fastq.gz
          echo -e "${sample}\t${file1}\t${file2}" >> $sample_table
        done
        
        while read -r sample mate1 mate2
        do
                echo "Processing sample: $sample"
                STAR --runMode alignReads \
                     --runThreadN 15 \
                     --genomeDir $IndexDir \
                     --sjdbFileChrStartEnd $it1Dir/*.SJ.out.tab \
                     --readFilesIn $fastaDir/$mate1 $fastaDir/$mate2 \
                     --readFilesCommand zcat \
                     --outFileNamePrefix $outDir/"$sample.it2." \
                     --outSAMtype BAM SortedByCoordinate \
                     --outSAMunmapped Within \
                     --outSAMattributes All
        
        done < $sample_table
        
        samtools index -M $outDir/*it2.*.out.bam

        Pre-lim QC

        Running samtools flagstat and stats function on all the .bam files

        $ sbatch $MyScripts/flagstat.sh

        #!/bin/bash -l
        #SBATCH --job-name=STAR_flagstat
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/STAR_flagstat_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/STAR_flagstat_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        # Activate your environment
        mamba activate angsd
        
        outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/alignment_it2_files
        
        # Iterate over the fastq files in the directory
        for file in /athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/alignment_it2_files/*/*.out.bam; do
          filename=$(basename ${file})
          foldername=$(echo $filename | cut -d ".it2" -f 1)
          samtools flagstat ${file} > ${outDir}/${foldername}/${filename}.flagstat.txt
          samtools stats ${file} > ${outDir}/${foldername}/${filename}.stats.txt
          echo "$filename done"
        done

        Counting reads

        I have used the following script to run the featureCounts command -

        $ sbatch $MyScripts/featureCounts.sh:

        #!/bin/bash -l
        #SBATCH --job-name=featureCounts
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/featureCounts_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/featureCounts_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        #SBATCH --time=10:00:00
        #SBATCH --mem=32G
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1
        #SBATCH --cpus-per-task=15
        
        # Activate your environment
        mamba activate angsd
        
        AnnotationFile=/athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.ncbiRefSeq.gtf
        
        bamDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/alignment_it2_files/*
        
        outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/featureCounts
        
        featureCounts -p -g gene_id -t exon --countReadPairs -T 10 -a $AnnotationFile -o ${outDir}/Counts.txt ${bamDir}/*.it2.Aligned.sortedByCoord.out.bam

        The command is explained below:

        • -p: Indicates that the input BAM files are paired-end reads.

        • -g gene_id: Specifies the feature type to be counted.

        • -T 10: Sets the number of threads used by featureCounts to 10.

        • -a $AnnotationFile: Specifies the annotation file to be used.

        • -o counts.txt: Specifies the output file name.

        Generating a bar plot (using ggplot2) that displays the numbers of assigned and unassigned reads for the featureCounts run:

        mySummary <- read.table("STAR/featureCounts/Counts.txt.summary")
        colNames <- sapply(mySummary[1,], function(x) gsub(".it2.Aligned.sortedByCoord.out.bam", "", basename(x)))
        colnames(mySummary) <- colNames
        mySummary <- as.data.frame(mySummary[-1,])
        
        # Melt the data and convert the value column to numeric
        long_mySummary <- melt(setDT(mySummary), id.vars = "Status")
        long_mySummary[, value := as.numeric(value)]
        
        # Remove zero value rows
        long_mySummary <- filter(long_mySummary, value != 0)
        long_mySummary
        ##                      Status  variable    value
        ##  1:                Assigned SPP1_KO_1 15330724
        ##  2:     Unassigned_Unmapped SPP1_KO_1   991633
        ##  3: Unassigned_MultiMapping SPP1_KO_1  1151223
        ##  4:   Unassigned_NoFeatures SPP1_KO_1  1030523
        ##  5:    Unassigned_Ambiguity SPP1_KO_1   285949
        ##  6:                Assigned SPP1_KO_2 20682764
        ##  7:     Unassigned_Unmapped SPP1_KO_2   674449
        ##  8: Unassigned_MultiMapping SPP1_KO_2  1537804
        ##  9:   Unassigned_NoFeatures SPP1_KO_2   925508
        ## 10:    Unassigned_Ambiguity SPP1_KO_2   380962
        ## 11:                Assigned SPP1_KO_3 23315712
        ## 12:     Unassigned_Unmapped SPP1_KO_3   859049
        ## 13: Unassigned_MultiMapping SPP1_KO_3  1805626
        ## 14:   Unassigned_NoFeatures SPP1_KO_3  1109579
        ## 15:    Unassigned_Ambiguity SPP1_KO_3   451127
        ## 16:                Assigned SPP1_KO_4 19236624
        ## 17:     Unassigned_Unmapped SPP1_KO_4   859499
        ## 18: Unassigned_MultiMapping SPP1_KO_4  1497251
        ## 19:   Unassigned_NoFeatures SPP1_KO_4  1137605
        ## 20:    Unassigned_Ambiguity SPP1_KO_4   355307
        ## 21:                Assigned      wt_1 23547362
        ## 22:     Unassigned_Unmapped      wt_1   874932
        ## 23: Unassigned_MultiMapping      wt_1  1771815
        ## 24:   Unassigned_NoFeatures      wt_1   945999
        ## 25:    Unassigned_Ambiguity      wt_1   459132
        ## 26:                Assigned      wt_2 26726542
        ## 27:     Unassigned_Unmapped      wt_2  1043887
        ## 28: Unassigned_MultiMapping      wt_2  2087863
        ## 29:   Unassigned_NoFeatures      wt_2  1094040
        ## 30:    Unassigned_Ambiguity      wt_2   532271
        ## 31:                Assigned      wt_3 20706864
        ## 32:     Unassigned_Unmapped      wt_3   760604
        ## 33: Unassigned_MultiMapping      wt_3  1562752
        ## 34:   Unassigned_NoFeatures      wt_3   888762
        ## 35:    Unassigned_Ambiguity      wt_3   413805
        ## 36:                Assigned      wt_4 24887319
        ## 37:     Unassigned_Unmapped      wt_4   871030
        ## 38: Unassigned_MultiMapping      wt_4  1901041
        ## 39:   Unassigned_NoFeatures      wt_4  1071408
        ## 40:    Unassigned_Ambiguity      wt_4   491248
        ##                      Status  variable    value
        ggplot(data = long_mySummary, aes(x = value, y = variable, fill = Status)) + geom_bar(stat = "identity", position = 'dodge') + labs(x = "# reads", y = "samples") +
          ggtitle("Counts.txt.summary")

        Inference: The total percentages of assigned reads for all the samples are comparable. The alignment appears to be successful.

        Running RSeQC

        I have downloaded the housekeeping .bed file for mm10 from this link and imported it to my work-space using the scp command.

        I have used the following scripts to run rseqc:

        $ sbatch $MyScripts/genebody_coverage.sh

        #!/bin/bash -l
        #SBATCH --job-name=genebody_coverage
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/genebody_coverage_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/genebody_coverage_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        #SBATCH --time=90:00:00
        #SBATCH --mem=40G
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1
        #SBATCH --cpus-per-task=15
        
        mamba activate rseqc
        
        bamDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/alignment_it2_files/*
        
        bedFile=/athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.HouseKeepingGenes.bed
        
        outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/rseqc
        
        # Iterate over the bam files in the directory
        for file in ${bamDir}/*.it2.Aligned.sortedByCoord.out.bam; do
                filename=$(basename ${file})
                #read_distribution.py -i $file -r $bedFile > ${outDir}/${filename}rseqc_read_distribution.out
                geneBody_coverage.py -i $file -r $bedFile -o ${outDir}/${filename}rseqc_geneBodyCoverage.txt
        done

        $ sbatch $MyScripts/read_distribution.sh

        #!/bin/bash -l
        #SBATCH --job-name=read_distribution
        #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/read_distribution_%j.out
        #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/read_distribution_%j.err
        #SBATCH --mail-user=moa4020@med.cornell.edu
        #SBATCH --mail-type=ALL
        #SBATCH --time=10:00:00
        #SBATCH --mem=40G
        #SBATCH --nodes=1
        #SBATCH --ntasks-per-node=1
        #SBATCH --cpus-per-task=15
        
        mamba activate rseqc
        
        bamDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/alignment_it2_files/*
        
        bedFile=/athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.ncbiRefSeq.bed
        
        outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2/rseqc
        
        # Iterate over the bam files in the directory
        for file in ${bamDir}/*.it2.Aligned.sortedByCoord.out.bam; do
                filename=$(basename ${file})
                read_distribution.py -i $file -r $bedFile > ${outDir}/${filename}rseqc_read_distribution.out
                #geneBody_coverage.py -i $file -r $bedFile -o ${outDir}/${filename}rseqc_geneBodyCoverage.txt
        done

        Running multiqc post-alignment and RSeQC:

        mamba activate multiqc
        multiqc .

        STAR Post-alignment MultiQC

        includeHTML("STAR_multiqc_report.html")
        MultiQC Report

        Highlight Samples

        Regex mode off

          Rename Samples

          Click here for bulk input.

          Paste two columns of a tab-delimited table here (eg. from Excel).

          First column should be the old name, second column the new name.

          Regex mode off

            Show / Hide Samples

            Regex mode off

              Export Plots

              px
              px
              X

              Download the raw data used to create the plots in this report below:

              Note that additional data was saved in multiqc_data when this report was generated.


              Choose Plots

              If you use plots from MultiQC in a publication or presentation, please cite:

              MultiQC: Summarize analysis results for multiple tools and samples in a single report
              Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
              Bioinformatics (2016)
              doi: 10.1093/bioinformatics/btw354
              PMID: 27312411

              Save Settings

              You can save the toolbox settings for this report to the browser.


              Load Settings

              Choose a saved report profile from the dropdown box below:

              Tool Citations

              Please remember to cite the tools that you use in your analysis.

              To help with this, you can download publication details of the tools mentioned in this report:

              About MultiQC

              This report was generated using MultiQC, version 1.14

              You can see a YouTube video describing how to use MultiQC reports here: https://youtu.be/qPbIlO_KWN0

              For more information about MultiQC, including other videos and extensive documentation, please visit http://multiqc.info

              You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: https://github.com/ewels/MultiQC

              MultiQC is published in Bioinformatics:

              MultiQC: Summarize analysis results for multiple tools and samples in a single report
              Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
              Bioinformatics (2016)
              doi: 10.1093/bioinformatics/btw354
              PMID: 27312411

              A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.

              Report generated on 2023-04-19, 19:57 EDT based on data in: /athena/angsd/scratch/moa4020/project/GEO_Dataset/STAR_alignments_it2


              General Statistics

              Showing 8/8 rows and 11/14 columns.
              Sample Name% AssignedM AssignedError rateM Non-PrimaryM Reads Mapped% Mapped% Proper PairsM Total seqsM Reads Mapped% AlignedM Aligned
              SPP1_KO_1.it2
              81.6%
              15.3
              0.26%
              1.3
              34.3
              94.5%
              94.5%
              36.3
              35.6
              91.8%
              16.6
              SPP1_KO_2.it2
              85.5%
              20.7
              0.24%
              1.7
              45.3
              97.1%
              97.1%
              46.7
              47.1
              94.3%
              22.0
              SPP1_KO_3.it2
              84.7%
              23.3
              0.26%
              2.0
              51.3
              96.8%
              96.8%
              53.0
              53.4
              93.8%
              24.9
              SPP1_KO_4.it2
              83.3%
              19.2
              0.23%
              1.7
              42.8
              96.1%
              96.1%
              44.5
              44.5
              93.2%
              20.7
              wt_1.it2
              85.3%
              23.5
              0.25%
              2.0
              51.4
              96.7%
              96.7%
              53.2
              53.4
              93.8%
              25.0
              wt_2.it2
              84.9%
              26.7
              0.27%
              2.4
              58.5
              96.6%
              96.6%
              60.6
              60.9
              93.6%
              28.4
              wt_3.it2
              85.1%
              20.7
              0.25%
              1.8
              45.4
              96.8%
              96.8%
              46.9
              47.1
              93.9%
              22.0
              wt_4.it2
              85.2%
              24.9
              0.25%
              2.2
              54.5
              96.9%
              96.9%
              56.3
              56.7
              94.0%
              26.4

              RSeQC

              RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput RNA-seq data.DOI: 10.1093/bioinformatics/bts356.

              Read Distribution

              Read Distribution calculates how mapped reads are distributed over genome features.

              loading..

              Gene Body Coverage

              Gene Body Coverage calculates read coverage over gene bodies. This is used to check if reads coverage is uniform and if there is any 5' or 3' bias.

              loading..

              featureCounts

              Subread featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations.DOI: 10.1093/bioinformatics/btt656.

              loading..

              Samtools

              Samtools is a suite of programs for interacting with high-throughput sequencing data.DOI: 10.1093/bioinformatics/btp352.

              Percent Mapped

              Alignment metrics from samtools stats; mapped vs. unmapped reads.

              For a set of samples that have come from the same multiplexed library, similar numbers of reads for each sample are expected. Large differences in numbers might indicate issues during the library preparation process. Whilst large differences in read numbers may be controlled for in downstream processings (e.g. read count normalisation), you may wish to consider whether the read depths achieved have fallen below recommended levels depending on the applications.

              Low alignment rates could indicate contamination of samples (e.g. adapter sequences), low sequencing quality or other artefacts. These can be further investigated in the sequence level QC (e.g. from FastQC).

              loading..

              Alignment metrics

              This module parses the output from samtools stats. All numbers in millions.

              loading..

              Samtools Flagstat

              This module parses the output from samtools flagstat. All numbers in millions.

              loading..

              STAR

              STAR is an ultrafast universal RNA-seq aligner.DOI: 10.1093/bioinformatics/bts635.

              Alignment Scores

              loading..

              Inference: An average of 45-50 million transcripts were assigned to the exonic regions of the genome. The gene body coverage for all the samples appears to have a uniform trend. The alignment was successful and no bias was observed towards one condition.

              Downloading and Installing Salmon

              $ conda config --add channels conda-forge
              $ conda config --add channels bioconda
              $ conda create -n salmon salmon
              # Activating salmon
              $ conda activate salmon

              Building the reference transcriptome and genome for salmon indexing

              Converting a .gtf file to a txdb object to produce a transcriptome file using the following code:

              #' Export Transcriptome as FASTA
              #'
              #' @param txdb a \code{TxDb} object representing a transcriptome annotation
              #' @param genome a \code{BSgenome} object from which to extract sequences
              #' @param file a string for output FASTA file. File names ending in "**.gz**"
              #'     will automatically use gzip compression.
              #' @param ... additional arguments to pass through to
              #'     \code{\link[Biostrings]{writeXStringSet}}
              #' @return The \code{txdb} argument is invisibly returned.
              #'
              #' @examples
              #' library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
              library(BSgenome.Mmusculus.UCSC.mm10)
              #'
              #' ## load annotation and genome
              txdb <- GenomicFeatures::makeTxDbFromGFF('Downloaded_Data/mm10.ncbiRefSeq.gtf')
              mm10 <- BSgenome.Mmusculus.UCSC.mm10
              #'
              #' ## restrict to 'chrI' transcripts (makes for briefer example runtime)
              #' seqlevels(txdb) <- c("chrI")
              #'
              #' ## last 500 nts per tx
              #' txdb_w500 <- truncateTxome(txdb)
              #'
              #' ## export uncompressed
              #' outfile <- tempfile("sacCer3.sgdGene.w500", fileext=".fa")
              #' exportFASTA(txdb_w500, sacCer3, outfile)
              #'
              #' ## export compressed
              #' outfile <- tempfile("sacCer3.sgdGene.w500", fileext=".fa.gz")
              #' exportFASTA(txdb_w500, sacCer3, outfile)
              #'
              #' @importFrom GenomicFeatures extractTranscriptSeqs
              #' @importFrom Biostrings writeXStringSet
              #' @export
              exportFASTA <- function (txdb, genome, file, ...) {
                seqs <- extractTranscriptSeqs(genome, txdb, use.names=TRUE)
                compress <- grepl(".gz$", file)
                writeXStringSet(seqs, file, format="fasta", compress=compress, ...)
                invisible(txdb)
              }
              
              exportFASTA(txdb,mm10,"mm10.transcriptome.fa.gz")

              Preparing metadata

              Salmon indexing requires the names of the genome targets, which is extractable by using the grep command:

              $ grep "^>" <(gunzip -c /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.fa.gz) | cut -d " " -f 1 > /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/decoys.txt
              $ sed -i.bak -e 's/>//g' /athena/angsd/scratch/moa4020/project/referenceGenome/mm10/decoys.txt

              Along with the list of decoys salmon also needs the concatenated transcriptome and genome reference file for index. NOTE: the genome targets (decoys) should come after the transcriptome targets in the reference

              $ cat $MyProject/referenceGenome/mm10/mm10.transcriptome.fa.gz $MyProject/referenceGenome/mm10/mm10.fa.gz > $MyProject/referenceGenome/mm10/mm10_gentrome.fa.gz

              Salmon Indexing

              $ sbatch $MyScripts/salmon_indexing.sh

              #!/bin/bash -l
              #SBATCH --job-name=salmon_indexing
              #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/salmon_indexing_%j.out
              #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/salmon_indexing_%j.err
              #SBATCH --mail-user=moa4020@med.cornell.edu
              #SBATCH --mail-type=ALL
              #SBATCH --time=10:00:00
              #SBATCH --mem=40G
              #SBATCH --nodes=1
              #SBATCH --ntasks-per-node=1
              #SBATCH --cpus-per-task=15
              
              mamba activate salmon
              
              refDir=/athena/angsd/scratch/moa4020/project/referenceGenome/mm10
              outDir=/athena/angsd/scratch/moa4020/project/referenceGenome/Salmonindex
              
              salmon index -t $refDir/mm10_gentrome.fa.gz -d $refDir/decoys.txt -p 15 -i $outDir

              Salmon Mapping

              $ sbatch $MyScripts/salmon_mapping.sh

              #!/bin/bash -l
              #SBATCH --job-name=salmon_mapping
              #SBATCH --output=/home/moa4020/angsd/project/scripts/stdout/salmon_mapping%j.out
              #SBATCH --error=/home/moa4020/angsd/project/scripts/stderr/salmon_mapping%j.err
              #SBATCH --mail-user=moa4020@med.cornell.edu
              #SBATCH --mail-type=ALL
              #SBATCH --time=10:00:00
              #SBATCH --mem=40G
              #SBATCH --nodes=1
              #SBATCH --ntasks-per-node=1
              #SBATCH --cpus-per-task=15
              
              mamba activate salmon
              
              indexDir=/athena/angsd/scratch/moa4020/project/referenceGenome/Salmonindex
              
              fastqDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/fastq
              
              outDir=/athena/angsd/scratch/moa4020/project/GEO_Dataset/salmon_alignments
              
              gtfFile=/athena/angsd/scratch/moa4020/project/referenceGenome/mm10/mm10.ncbiRefSeq.gtf
              
              # Make a table with the sample name and paths to the 1st and 2nd mate
              
              # Set the name of the output file
              sample_table=sample_table.txt
              
              > $sample_table
              
              # Iterate over the fastq files in the directory
              for file in ${fastqDir}/*_1.fastq.gz; do
                file1=$(basename ${file})
                sample=$(basename ${file%_1.fastq.gz})
                file2=${sample}_2.fastq.gz
                echo -e "${sample}\t${file1}\t${file2}" >> $sample_table
                done
              
              while read -r sample mate1 mate2
              do
                    echo "Processing sample: $sample"
                    salmon quant -g $gtfFile -i $indexDir -l A -1 $fastqDir/$mate1 -2 $fastqDir/$mate2 -p 15  --validateMappings --writeUnmappedNames -o $outDir/"$sample"
              
              done < $sample_table

              Salmon Post-Alignment Multiqc

              MultiQC Report

              Loading report..

              Highlight Samples

              Regex mode off

                Rename Samples

                Click here for bulk input.

                Paste two columns of a tab-delimited table here (eg. from Excel).

                First column should be the old name, second column the new name.

                Regex mode off

                  Show / Hide Samples

                  Regex mode off

                    Export Plots

                    px
                    px
                    X

                    Download the raw data used to create the plots in this report below:

                    Note that additional data was saved in multiqc_data when this report was generated.


                    Choose Plots

                    If you use plots from MultiQC in a publication or presentation, please cite:

                    MultiQC: Summarize analysis results for multiple tools and samples in a single report
                    Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
                    Bioinformatics (2016)
                    doi: 10.1093/bioinformatics/btw354
                    PMID: 27312411

                    Save Settings

                    You can save the toolbox settings for this report to the browser.


                    Load Settings

                    Choose a saved report profile from the dropdown box below:

                    Tool Citations

                    Please remember to cite the tools that you use in your analysis.

                    To help with this, you can download publication details of the tools mentioned in this report:

                    About MultiQC

                    This report was generated using MultiQC, version 1.14

                    You can see a YouTube video describing how to use MultiQC reports here: https://youtu.be/qPbIlO_KWN0

                    For more information about MultiQC, including other videos and extensive documentation, please visit http://multiqc.info

                    You can report bugs, suggest improvements and find the source code for MultiQC on GitHub: https://github.com/ewels/MultiQC

                    MultiQC is published in Bioinformatics:

                    MultiQC: Summarize analysis results for multiple tools and samples in a single report
                    Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller
                    Bioinformatics (2016)
                    doi: 10.1093/bioinformatics/btw354
                    PMID: 27312411

                    A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.

                    Report generated on 2023-04-20, 02:49 EDT based on data in: /athena/angsd/scratch/moa4020/project/GEO_Dataset/salmon_alignments


                    General Statistics

                    Showing 8/8 rows and 2/2 columns.
                    Sample Name% AlignedM Aligned
                    SPP1_KO_1
                    88.0%
                    15.9
                    SPP1_KO_2
                    91.7%
                    21.4
                    SPP1_KO_3
                    91.0%
                    24.1
                    SPP1_KO_4
                    89.8%
                    20.0
                    wt_1
                    91.8%
                    24.4
                    wt_2
                    91.9%
                    27.9
                    wt_3
                    91.7%
                    21.5
                    wt_4
                    91.8%
                    25.8

                    Salmon

                    Salmon is a tool for quantifying the expression of transcripts using RNA-seq data.DOI: 10.1038/nmeth.4197.

                    loading..

                    Inference: from the above QC post salmon mapping, the percent of aligned reads indicates a successful feature counting run.

                    DESeq

                    FILE_DSD="DESeqReadyFiles.RData"
                    load(FILE_DSD)
                    
                    DESeq.ds <- DESeq(DESeq.ds)
                    
                    salmon_DESeq.ds <- DESeq(salmon_DESeq.ds)
                    rlog.dge <- DESeq.rlog[DGEgenes,] %>% assay
                    salmon_rlog.dge <- salmon_DESeq.rlog[salmon_DGEgenes,] %>% assay
                    # Sanity check to see if both the conditions actually lead to 2 different clusters
                    DESeq2::plotPCA(DESeq.rlog, intgroup = "condition", ntop = 500, returnData = FALSE)

                    DESeq2::plotPCA(salmon_DESeq.rlog, intgroup = "condition", ntop = 500, returnData = FALSE)

                    pheatmap(rlog.dge, scale="row",
                             show_rownames=FALSE, main="STAR DGE (row-based z-score)")

                    pheatmap(salmon_rlog.dge, scale="row",
                             show_rownames=FALSE, main="Salmon DGE (row-based z-score)")

                    STAR_vp <- EnhancedVolcano(DGE.results.shrnk, lab=rownames(DGE.results.shrnk),
                                           x='log2FoldChange', y='padj', pCutoff = 0.05,
                                           title="STAR (with logFC shrinkage)")
                    ## Warning: One or more p-values is 0. Converting to 10^-1 * current lowest
                    ## non-zero p-value...
                    Salmon_vp <- EnhancedVolcano(salmon_DGE.results.shrnk, lab=rownames(salmon_DGE.results.shrnk),
                                           x='log2FoldChange', y='padj', pCutoff = 0.05,
                                           title="Salmon (with logFC shrinkage)")
                    ## Warning: One or more p-values is 0. Converting to 10^-1 * current lowest
                    ## non-zero p-value...
                    STAR_vp + Salmon_vp

                    These volcano plots confer with the volcano plot produced in the literature for this dataset.

                    DGE.results %>% `[`(order(.$padj),) %>% head
                    ## log2 fold change (MLE): condition SPP1 KO vs wt 
                    ## Wald test p-value: condition SPP1 KO vs wt 
                    ## DataFrame with 6 rows and 6 columns
                    ##          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                    ##         <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
                    ## Spp1    14917.826       -7.11281 0.1611036  -44.1505  0.00000e+00  0.00000e+00
                    ## Mrc1     1130.756        4.33284 0.1201555   36.0603 9.52727e-285 9.76164e-281
                    ## Cd93     1779.027        4.93173 0.1484807   33.2146 6.61766e-242 4.52031e-238
                    ## Xdh       967.098        3.70743 0.1148878   32.2700 1.84404e-228 9.44704e-225
                    ## Pld4     1303.429        3.32855 0.1110480   29.9739 2.14610e-197 8.79558e-194
                    ## Pik3ap1  1679.039        2.76266 0.0937521   29.4677 7.46436e-191 2.54933e-187
                    salmon_DGE.results %>% `[`(order(.$padj),) %>% head
                    ## log2 fold change (MLE): condition SPP1 KO vs wt 
                    ## Wald test p-value: condition SPP1 KO vs wt 
                    ## DataFrame with 6 rows and 6 columns
                    ##          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                    ##         <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
                    ## Mrc1      1134.00        4.37009 0.1158709   37.7152  0.00000e+00  0.00000e+00
                    ## Spp1     15131.58       -6.29562 0.1065286  -59.0979  0.00000e+00  0.00000e+00
                    ## Cd93      1760.21        4.95606 0.1494439   33.1634 3.63555e-241 2.43679e-237
                    ## Xdh        965.02        3.70092 0.1128660   32.7904 8.06465e-236 4.05410e-232
                    ## Pik3ap1   1723.48        2.77874 0.0907901   30.6062 1.01213e-205 4.07037e-202
                    ## Pld4      1274.49        3.34293 0.1111628   30.0724 1.11255e-198 3.72854e-195
                    DGE.genes <- subset(DGE.results, log2FoldChange > 1.5 &  padj < 0.05)
                    
                    DGE.genes  %>% `[`(order(.$padj),) %>% head
                    ## log2 fold change (MLE): condition SPP1 KO vs wt 
                    ## Wald test p-value: condition SPP1 KO vs wt 
                    ## DataFrame with 6 rows and 6 columns
                    ##          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                    ##         <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
                    ## Mrc1     1130.756        4.33284 0.1201555   36.0603 9.52727e-285 9.76164e-281
                    ## Cd93     1779.027        4.93173 0.1484807   33.2146 6.61766e-242 4.52031e-238
                    ## Xdh       967.098        3.70743 0.1148878   32.2700 1.84404e-228 9.44704e-225
                    ## Pld4     1303.429        3.32855 0.1110480   29.9739 2.14610e-197 8.79558e-194
                    ## Pik3ap1  1679.039        2.76266 0.0937521   29.4677 7.46436e-191 2.54933e-187
                    ## Igf2    14416.806        3.63074 0.1268241   28.6282 2.99615e-180 8.77103e-177
                    salmon_DGE.genes <- subset(salmon_DGE.results,log2FoldChange > 1.5 & padj < 0.05)
                    
                    salmon_DGE.genes  %>% `[`(order(.$padj),) %>% head
                    ## log2 fold change (MLE): condition SPP1 KO vs wt 
                    ## Wald test p-value: condition SPP1 KO vs wt 
                    ## DataFrame with 6 rows and 6 columns
                    ##          baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                    ##         <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
                    ## Mrc1      1134.00        4.37009 0.1158709   37.7152  0.00000e+00  0.00000e+00
                    ## Cd93      1760.21        4.95606 0.1494439   33.1634 3.63555e-241 2.43679e-237
                    ## Xdh        965.02        3.70092 0.1128660   32.7904 8.06465e-236 4.05410e-232
                    ## Pik3ap1   1723.48        2.77874 0.0907901   30.6062 1.01213e-205 4.07037e-202
                    ## Pld4      1274.49        3.34293 0.1111628   30.0724 1.11255e-198 3.72854e-195
                    ## Igf2     14474.60        3.63923 0.1248785   29.1422 1.04955e-186 3.01490e-183

                    KEGG Pathway Enrichment

                    orgdb <- org.Mm.eg.db
                    
                    txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
                    
                    keytypes(txdb)
                    ## [1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"
                    gene.vector <- AnnotationDbi::select(orgdb, keys=row.names(DGE.genes), columns= c('ENTREZID','ENSEMBL'), keytype="SYMBOL", multiVals="first")
                    ## 'select()' returned 1:many mapping between keys and columns
                    head(gene.vector)
                    ##    SYMBOL ENTREZID            ENSEMBL
                    ## 1   Padi2    18600 ENSMUSG00000028927
                    ## 2 Olfr111     <NA>               <NA>
                    ## 3  Olfr99     <NA>               <NA>
                    ## 4    Lst1    16988 ENSMUSG00000073412
                    ## 5 Ncr3-ps   667612 ENSMUSG00000086076
                    ## 6    Aif1    11629 ENSMUSG00000024397
                    STAR_Enrichment <- enrichKEGG(
                      gene.vector[,2],
                      organism = "mmu",
                      keyType = "kegg",
                      pvalueCutoff = 0.05,
                      pAdjustMethod = "BH",
                      minGSSize = 10,
                      maxGSSize = 500,
                      qvalueCutoff = 0.2,
                      use_internal_data = FALSE
                    )
                    salmon_gene.vector <- AnnotationDbi::select(orgdb, keys=row.names(salmon_DGE.genes), columns= c('ENTREZID','ENSEMBL'), keytype="SYMBOL")
                    ## 'select()' returned 1:many mapping between keys and columns
                    head(salmon_gene.vector)
                    ##         SYMBOL ENTREZID            ENSEMBL
                    ## 1 LOC115489325     <NA>               <NA>
                    ## 2          Uty    22290 ENSMUSG00000068457
                    ## 3      Eif2s3y    26908 ENSMUSG00000069049
                    ## 4        Kdm5d    20592 ENSMUSG00000056673
                    ## 5         Tlr8   170744 ENSMUSG00000040522
                    ## 6        Reps2   194590 ENSMUSG00000040855
                    salmon_Enrichment <- enrichKEGG(
                      salmon_gene.vector[,2],
                      organism = "mmu",
                      keyType = "kegg",
                      pvalueCutoff = 0.05,
                      pAdjustMethod = "BH",
                      qvalueCutoff = 0.2,
                      minGSSize = 10,
                      maxGSSize = 500,
                      use_internal_data = FALSE
                    )
                    #autophagy is in the list
                    dotplot(STAR_Enrichment)

                    dotplot(salmon_Enrichment)

                    STAR_Enrichment@result[["Description"]][1:20]
                    ##  [1] "Complement and coagulation cascades - Mus musculus (house mouse)"                          
                    ##  [2] "Cytokine-cytokine receptor interaction - Mus musculus (house mouse)"                       
                    ##  [3] "Osteoclast differentiation - Mus musculus (house mouse)"                                   
                    ##  [4] "B cell receptor signaling pathway - Mus musculus (house mouse)"                            
                    ##  [5] "Viral protein interaction with cytokine and cytokine receptor - Mus musculus (house mouse)"
                    ##  [6] "Tuberculosis - Mus musculus (house mouse)"                                                 
                    ##  [7] "Pertussis - Mus musculus (house mouse)"                                                    
                    ##  [8] "Chemokine signaling pathway - Mus musculus (house mouse)"                                  
                    ##  [9] "Hematopoietic cell lineage - Mus musculus (house mouse)"                                   
                    ## [10] "NF-kappa B signaling pathway - Mus musculus (house mouse)"                                 
                    ## [11] "Staphylococcus aureus infection - Mus musculus (house mouse)"                              
                    ## [12] "NOD-like receptor signaling pathway - Mus musculus (house mouse)"                          
                    ## [13] "Leishmaniasis - Mus musculus (house mouse)"                                                
                    ## [14] "Coronavirus disease - COVID-19 - Mus musculus (house mouse)"                               
                    ## [15] "Platelet activation - Mus musculus (house mouse)"                                          
                    ## [16] "Neutrophil extracellular trap formation - Mus musculus (house mouse)"                      
                    ## [17] "Natural killer cell mediated cytotoxicity - Mus musculus (house mouse)"                    
                    ## [18] "Lipid and atherosclerosis - Mus musculus (house mouse)"                                    
                    ## [19] "Fc gamma R-mediated phagocytosis - Mus musculus (house mouse)"                             
                    ## [20] "Leukocyte transendothelial migration - Mus musculus (house mouse)"
                    salmon_Enrichment@result[["Description"]][1:20]
                    ##  [1] "Complement and coagulation cascades - Mus musculus (house mouse)"                          
                    ##  [2] "Osteoclast differentiation - Mus musculus (house mouse)"                                   
                    ##  [3] "B cell receptor signaling pathway - Mus musculus (house mouse)"                            
                    ##  [4] "Cytokine-cytokine receptor interaction - Mus musculus (house mouse)"                       
                    ##  [5] "Viral protein interaction with cytokine and cytokine receptor - Mus musculus (house mouse)"
                    ##  [6] "Tuberculosis - Mus musculus (house mouse)"                                                 
                    ##  [7] "Chemokine signaling pathway - Mus musculus (house mouse)"                                  
                    ##  [8] "NF-kappa B signaling pathway - Mus musculus (house mouse)"                                 
                    ##  [9] "Pertussis - Mus musculus (house mouse)"                                                    
                    ## [10] "Staphylococcus aureus infection - Mus musculus (house mouse)"                              
                    ## [11] "NOD-like receptor signaling pathway - Mus musculus (house mouse)"                          
                    ## [12] "Coronavirus disease - COVID-19 - Mus musculus (house mouse)"                               
                    ## [13] "Leishmaniasis - Mus musculus (house mouse)"                                                
                    ## [14] "Neutrophil extracellular trap formation - Mus musculus (house mouse)"                      
                    ## [15] "Hematopoietic cell lineage - Mus musculus (house mouse)"                                   
                    ## [16] "Natural killer cell mediated cytotoxicity - Mus musculus (house mouse)"                    
                    ## [17] "Lipid and atherosclerosis - Mus musculus (house mouse)"                                    
                    ## [18] "Platelet activation - Mus musculus (house mouse)"                                          
                    ## [19] "Fc gamma R-mediated phagocytosis - Mus musculus (house mouse)"                             
                    ## [20] "Leukocyte transendothelial migration - Mus musculus (house mouse)"
                    # create an empty vector to store the values
                    values <- numeric(333)
                    
                    # loop over all values of x from 1 to 333
                    for (x in 1:333) {
                      # calculate the value of the equation for the current value of x
                      value <- length(intersect(salmon_Enrichment@result[["Description"]][1:x],STAR_Enrichment@result[["Description"]][1:x]))
                      # store the value in the vector
                      values[x] <- value/x
                    }
                    
                    plot(values, type = "l", xlab = "x", ylab = "Identity")

                    plot(values[1:40], type = "l", xlab = "x", ylab = "Identity")

                    plot(values[1:10], type = "l", xlab = "x", ylab = "Identity")

                    List of genes mentioned to be differentially expressed

                    Pan Reactive A1-Specific A2-Specific
                    Lcn2 H2-T23 Clcf1
                    Steap4 Serping1 Tgm1
                    S1pr3 H2-D1 Ptx3
                    Timp1 Ggta1 S100a10
                    Hspb1 Iigp1 Sphk1
                    Cxcl10 Gbp2 Cd109
                    Cd44 Fbln5 Ptgs2
                    Osmr Ugt1a1 Emp1
                    Cp Fkbp5 Tm4sf1
                    Serpina3n Psmb8 Cd14
                    Aspg Srgn B3gnt5
                    Vim Amigo2
                    Gfap
                    DEGs <- read.csv(file="DEG_Paper.csv", header = TRUE)
                    PanReactive <- DEGs[,1]
                    PanCounts <- counts(DESeq.ds[PanReactive, ])
                    
                    A1Specific <- DEGs[1:12,2]
                    A1Counts <- counts(DESeq.ds[A1Specific, ])
                    
                    A2Specific <- DEGs[1:11,3]
                    A2Counts <- counts(DESeq.ds[A2Specific, ])

                    Pan Reactive Genes Up-regulated in SPP1_KO samples

                    for(gene in PanReactive){
                      par(mfrow=c(1,2))
                      plotCounts(DESeq.ds, gene=gene, normalized = TRUE, xlab="STAR")
                      plotCounts(salmon_DESeq.ds, gene = gene, normalized = TRUE, xlab="Salmon", main = gene)
                    }

                    A1-Specific Genes Up-regulated in SPP1_KO samples

                    for(gene in A1Specific){
                      par(mfrow=c(1,2))
                      plotCounts(DESeq.ds, gene=gene, normalized = TRUE, xlab="STAR")
                      plotCounts(salmon_DESeq.ds, gene = gene, normalized = TRUE, xlab="Salmon", main = gene)
                    }

                    No particular trend observed in A2-Specific Genes in SPP1_KO samples

                    for(gene in A2Specific){
                      par(mfrow=c(1,2))
                      plotCounts(DESeq.ds, gene=gene, normalized = TRUE, xlab="STAR")
                      plotCounts(salmon_DESeq.ds, gene = gene, normalized = TRUE, xlab="Salmon", main = gene)
                    }

                    The individual gene count-plots confer with the results from the study performed.

                    Discussion

                    A brief description/list of issues/problems/limitations you encountered along the way and how you addressed them.

                    1. While creating an index using the “knowngene” version of the .gtf file lead to counting just the transcripts and not the genes since this file did not contain any gene_ids. I dealt with this problem by re-indexing the mm10 genome for both the algorithms using the ncbiRefSeq.gtf file from the UCSC table browser.
                    2. Performing the Gene Ontology analysis on the DESeq data turned out to be misleading since I did not apply the log2FC cutoff of 1.5. I corrected this by filtering out genes which have less than 1.5 log2FC.
                    3. Adding the log2FC filter did not help produce identical pathway enrichment results displayed in the paper.

                    A table that summarizes the key data sets that you have generated during the analyses and decided to keep.